Skip to content

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files #1033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 6, 2025

Conversation

santoshborse
Copy link
Collaborator

Why are these changes needed?

Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files

Related issue number (if any).

#1009

@santoshborse santoshborse force-pushed the tokenization2arrow_transform branch from 4231f1d to b0d1490 Compare February 10, 2025 19:30
@touma-I touma-I requested a review from cmadam February 10, 2025 19:38
@shahrokhDaijavad
Copy link
Collaborator

Thank you very much, @santoshborse. I tested the transform by running make run-cli-sample-python successfully.
There are some things missing like a test directory that we use for CI/CD testing and I will make the README more consistent with the other READMEs, but for now, one small change in README:

make run-cli-sample => make run-cli-sample-python
and another urgent request for a simple notebook that @Hajar-Emami can use to mimic in her GneissWeb recipe notebook.
An example of such a minimum notebook is this one:
https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/tokenization.ipynb
i.e.,

  1. Show the CLI option table that you have in README
  2. from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
  3. Tokenization2Arrow(input_folder= "test-data/ds02/input",
    output_folder= "output",
    .... other CLI arguments).transform()
  4. import glob
    glob.glob("output/*")

Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with this transform is that it has no tests:

$ make test-src
...
source venv/bin/activate;       \
export PYTHONPATH=../src:../: ;  \
cd test; pytest -s .
/bin/bash: line 3: cd: test: No such file or directory
============================================================================================================ test session starts ============================================================================================================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/cma/de/data-prep-kit/transforms
configfile: pyproject.toml
plugins: cov-6.0.0, anyio-4.8.0
collected 0 items                                                                                                                                                                                                                           

=========================================================================================================== no tests ran in 0.01s ===========================================================================================================
make: *** [../../../.make.defaults:442: .defaults.test-src] Error 5

Please add some tests to the transform.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, it may be better to use ray.transform. This will make it easier for maintaining the module

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because we have transform_python which we use as,
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow

I makes more sense to have transform_ray, but if you insist I will change that

Copy link
Collaborator

@touma-I touma-I Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santoshborse The latest convention is to use dpk_<transform>.runtime (instead of dpk_.transform_python) and dpk_<transform>.ray.runtime . It will be nice if you can stay with the convention. It makes suppport easier and there does not seem to be a good reason not to.

# TODO: check if we should add anything to tokenization_metadata
return [(bos.getvalue().to_pybytes(), ".arrow")], tokenization_metadata

def transform_binary(self, file_name: str, byte_array: bytes) -> tuple[list[tuple[bytes, str]], dict[str, Any]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why this is implementing transform_binary() and not transform(). I think the code structure will be easier to understand/maitain if you implement transform() and then call super.transform() before calling transforms_to_arrow() .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform_binary returns - tuple[list[tuple[bytes, str]], dict[str, Any]]:
transform returns - tuple[list[pa.Table], dict[str, Any]]:

I am using transform_binary so that I can return data in bytes ( so that f/w can write .arrow files )

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide one or more Unit Test in the test folder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss how we can redo this one. Maybe 2 requirements.txt, one that is used as part of the packaging and one that is used for pulling the dependency on the tokenization. Also, have you considered making this module as part of the tokenization module ? Would it be easier for inheritance for this module to be an extension on the tokenization rather than its own ?

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the module to pyproject.toml for when building the wheel.

@shahrokhDaijavad
Copy link
Collaborator

Thanks for adding the notebook, @santoshborse!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santoshborse This file is missing the following:
TRANSFORM_PYTHON_SRC=
TRANSFORM_RAY_SRC=

see https://github.com/IBM/data-prep-kit/blob/dev/transforms/Makefile.transform.template for example. This is also related to the comment above on transform_python and transform_ray. Let's follow the convention since there is no really good reason not to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @touma-I I have updated the module names and Makefile as you asked.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santoshborse Thank you! This looks good. Sorry about the confusion. I am hoping in the next iteration we will simplify things further and get rid of a few constraints. Please stay tuned. I might also reach out to bounce off a few ideas.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santoshborse The two additional comments I added should address the failed CI/CD. Please reach out if anything is not clear.

santoshborse and others added 10 commits March 5, 2025 14:25
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Signed-off-by: Santosh Borse <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santoshborse Thank you! This looks good. Sorry about the confusion. I am hoping in the next iteration we will simplify things further and get rid of a few constraints. Please stay tuned. I might also reach out to bounce off a few ideas.

@touma-I
Copy link
Collaborator

touma-I commented Mar 5, 2025

@shahrokhDaijavad Can you review the readme.md and notebooks before I merge? Thanks

Copy link
Collaborator

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santoshborse In the two notebooks, can you please change
from dpk_tokenization2arrow.transform_python import Tokenization2Arrow
to
from dpk_tokenization2arrow.runtime import Tokenization2Arrow

and
from dpk_tokenization2arrow.transform_ray import Tokenization2Arrow
to
from dpk_tokenization2arrow.ray.runtime import Tokenization2Arrow

Also,
In the makefile, don't we need to make the same change for the 2 targets:
dpk_$(TRANSFORM_NAME).transform_python => dpk_$(TRANSFORM_NAME).runtime
and
dpk_$(TRANSFORM_NAME).transform_ray => dpk_$(TRANSFORM_NAME).ray.runtime

After these changes, if you don't mind, I will make a few small changes in the README file myself.

@santoshborse
Copy link
Collaborator Author

@shahrokhDaijavad updated the notebook and makefile, also tested locally to make sure notebook and both commands in makefile works.

Signed-off-by: Santosh Borse <[email protected]>
@touma-I touma-I self-requested a review March 6, 2025 18:31
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Copy link
Collaborator

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @santoshborse. I made a few changes to the README file.

@touma-I touma-I merged commit 261c925 into dev Mar 6, 2025
7 checks passed
@santoshborse santoshborse deleted the tokenization2arrow_transform branch May 28, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants